BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

rhshadrach · 2023-01-04T02:41:42Z

closes BUG: SeriesGroupBy.value_counts sorts when sort=False #50482 (Replace xxxx with the GitHub issue number)
closes BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This takes a bit of a perf hit with SeriesGroupBy.value_counts; I think this is because the current implementation always sorts the groupers (which is also fixed here).

pandas/pandas/core/groupby/generic.py

Lines 711 to 715 in 0a4ba64

    
           # group boundaries are where group ids change 
        
           idchanges = 1 + np.nonzero(ids[1:] != ids[:-1])[0] 
        
           idx = np.r_[0, idchanges] 
        
           if not len(ids): 
        
               idx = idchanges

Also get a good perf improvement with categorical.

Perf code

import numpy as np
import pandas as pd
import time

size = 1000
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
col3_values = np.random.choice(col2_possible_values, size=size, replace=True)
df = pd.DataFrame(zip(col1_values, col2_values, col3_values), columns=["col1", "col2", "col3"])

print('DataFrameGroupBy - object')
%timeit df.groupby("col1").value_counts()

print('SeriesGroupBy - object')
%timeit df.groupby("col1")["col2"].value_counts()

df['col2'] = df['col2'].astype('category')

print('DataFrameGroupBy - category')
%timeit df.groupby("col1", observed=True)[["col2"]].value_counts()

print('SeriesGroupBy - category')
%timeit df.groupby("col1", observed=True)["col2"].value_counts()

# main
DataFrameGroupBy - object
4.09 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
1.32 ms ± 7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
86.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
654 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This PR
DataFrameGroupBy - object
4.1 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
2 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
84.7 ms ± 576 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
83.1 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

…ormance

mroeschke

Looks fairly good. Just a merge conflict

mroeschke

LGTM after the merge conflicts are resolved

…pby_value_counts_sort # Conflicts: # doc/source/whatsnew/v2.0.0.rst # pandas/tests/groupby/test_value_counts.py

…rach/pandas into groupby_value_counts_sort # Conflicts: # doc/source/whatsnew/v2.0.0.rst

doc/source/whatsnew/v2.0.0.rst

Co-authored-by: Simon Hawkins <[email protected]>

…rach/pandas into groupby_value_counts_sort

…pby_value_counts_sort

rhshadrach · 2023-01-11T02:21:08Z

Merging to avoid any further whatsnew conflicts

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical perf…

8d7922b

…ormance

rhshadrach added Bug Groupby Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 4, 2023

rhshadrach added this to the 2.0 milestone Jan 4, 2023

mroeschke reviewed Jan 4, 2023

View reviewed changes

rhshadrach added 3 commits January 4, 2023 16:29

Merge branch 'main' into groupby_value_counts_sort

0b99daa

Merge branch 'main' into groupby_value_counts_sort

60d9519

Merge branch 'main' into groupby_value_counts_sort

f4b2123

mroeschke approved these changes Jan 9, 2023

View reviewed changes

rhshadrach added 2 commits January 10, 2023 06:46

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

e1d3ccf

…pby_value_counts_sort # Conflicts: # doc/source/whatsnew/v2.0.0.rst # pandas/tests/groupby/test_value_counts.py

Merge branch 'groupby_value_counts_sort' of https://github.com/rhshad…

334f2ab

…rach/pandas into groupby_value_counts_sort # Conflicts: # doc/source/whatsnew/v2.0.0.rst

simonjayhawkins reviewed Jan 10, 2023

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

rhshadrach and others added 4 commits January 10, 2023 07:43

Update doc/source/whatsnew/v2.0.0.rst

ebc9c78

Co-authored-by: Simon Hawkins <[email protected]>

Remove unnecessary imports

22bb78c

Merge branch 'groupby_value_counts_sort' of https://github.com/rhshad…

ce38b2f

…rach/pandas into groupby_value_counts_sort

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

2a91454

…pby_value_counts_sort

rhshadrach merged commit 4f42ecb into pandas-dev:main Jan 11, 2023

rhshadrach deleted the groupby_value_counts_sort branch January 11, 2023 02:22

rhshadrach mentioned this pull request Jan 21, 2023

BUG: .transform(ngroup) for axis=1 grouper #50761

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

Uh oh!

rhshadrach commented Jan 4, 2023

Uh oh!

mroeschke left a comment

Uh oh!

mroeschke left a comment

Uh oh!

Uh oh!

rhshadrach commented Jan 11, 2023

Uh oh!

Uh oh!

	# group boundaries are where group ids change
	idchanges = 1 + np.nonzero(ids[1:] != ids[:-1])[0]
	idx = np.r_[0, idchanges]
	if not len(ids):
	idx = idchanges

Uh oh!

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

Uh oh!

Conversation

rhshadrach commented Jan 4, 2023

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach commented Jan 11, 2023

Uh oh!

Uh oh!